Create a Manual Chinese Word Segmentation Dataset Using Crowdsourcing Method
نویسندگان
چکیده
The manual Chinese word segmentation dataset WordSegCHC 1.0 which was built by eight crowdsourcing tasks conducted on the Crowdflower platform contains the manual word segmentation data of 152 Chinese sentences whose length ranges from 20 to 46 characters without punctuations. All the sentences received 200 segmentation responses in their corresponding crowdsourcing tasks and the numbers of valid response of them range from 123 to 143 (each sentence was segmented by more than 120 subjects). We also proposed an evaluation method called manual segmentation error rate (MSER) to evaluate the dataset; the MSER of the dataset is proved to be very low which indicates reliable data quality. In this work, we applied the crowdsourcing method to Chinese word segmentation task and the results confirmed again that the crowdsourcing method is a promising tool for linguistic data collection; the framework of crowdsourcing linguistic data collection used in this work can be reused in similar tasks; the resultant dataset filled a gap in Chinese language resources to the best of our knowledge, and it has potential applications in the research of word intuition of Chinese speakers and Chinese language processing.
منابع مشابه
Text Window Denoising Autoencoder: Building Deep Architecture for Chinese Word Segmentation
Deep learning is the new frontier of machine learning research, which has led to many recent breakthroughs in English natural language processing. However, there are inherent differences between Chinese and English, and little work has been done to apply deep learning techniques to Chinese natural language processing. In this paper, we propose a deep neural network model: text window denoising ...
متن کاملExploring Mental Lexicon in an Efficient and Economic Way: Crowdsourcing Method for Linguistic Experiments
Mental lexicon plays a central role in human language competence and inspires the creation of new lexical resources. The traditional linguistic experiment methodwhich is used to exploremental lexicon has some disadvantages. Crowdsourcing has become a promising method to conduct linguistic experiments which enables us to explore mental lexicon in an efficient and economic way. We focus on the fe...
متن کاملUsing Part-of-Speech Reranking to Improve Chinese Word Segmentation
Chinese word segmentation and Part-ofSpeech (POS) tagging have been commonly considered as two separated tasks. In this paper, we present a system that performs Chinese word segmentation and POS tagging simultaneously. We train a segmenter and a tagger model separately based on linear-chain Conditional Random Fields (CRF), using lexical, morphological and semantic features. We propose an approx...
متن کاملError analysis and confidence measure of Chinese word segmentation
Word segmentation for a Chinese sentence is essential for many applications in language and speech processing. There’s no perfect method that could achieve word segmentation without any errors. We propose a confidence measure for the segmentation result to cope with the problem caused by the errors. The effective method depends mainly on the error analysis of the word segmentation. With the con...
متن کاملBuilding a Semantic Transparency Dataset of Chinese Nominal Compounds: A Practice of Crowdsourcing Methodology
This paper describes the work which aimed to create a semantic transparency dataset of Chinese nominal compounds (SemTransCNC 1.0) by crowdsourcing methodology. We firstly selected about 1,200 Chinese nominal compounds from a lexicon of modern Chinese and the Sinica Corpus. Then through a series of crowdsourcing experiments conducted on the Crowdflower platform, we successfully collected both o...
متن کامل